Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 5.645
Filtrar
1.
J Mol Biol ; 436(2): 168395, 2024 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-38097109

RESUMO

In this study, we utilize Protein Residue Networks (PRNs), constructed using Local Spatial Pattern (LSP) alignment, to explore the dynamic behavior of Catabolite Activator Protein (CAP) upon the sequential binding of cAMP. We employed the Degree Centrality of these PRNs to investigate protein dynamics on a sub-nanosecond time scale, hypothesizing that it would reflect changes in CAP's entropy related to its thermal motions. We show that the binding of the first cAMP led to an increase in stability in the Cyclic-Nucleotide Binding Domain A (CNBD-A) and destabilization in CNBD-B, agreeing with previous reports explaining the negative cooperativity of cAMP binding in terms of an entropy-driven allostery. LSP-based PRNs also allow for the study of Betweenness Centrality, another graph-theoretical characteristic of PRNs, providing insights into global residue connectivity within CAP. Using this approach, we were able to correctly identify amino acids that were shown to be critical in mediating allosteric interactions in CAP. The agreement between our studies and previous experimental reports validates our method, particularly with respect to the reliability of Degree Centrality as a proxy for entropy related to protein thermal dynamics. Because LSP-based PRNs can be easily extended to include dynamics of small organic molecules, polynucleotides, or other allosteric proteins, the methods presented here mark a significant advancement in the field, positioning them as vital tools for a fast, cost-effective, and accurate analysis of entropy-driven allostery and identification of allosteric hotspots.


Assuntos
Regulação Alostérica , Proteína Receptora de AMP Cíclico , Alinhamento de Sequência , Proteína Receptora de AMP Cíclico/química , Entropia , Simulação de Dinâmica Molecular , Ligação Proteica , Reprodutibilidade dos Testes , Alinhamento de Sequência/métodos
2.
Science ; 381(6664): eadg7492, 2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37733863

RESUMO

The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic.


Assuntos
Substituição de Aminoácidos , Doença , Mutação de Sentido Incorreto , Proteoma , Alinhamento de Sequência , Humanos , Substituição de Aminoácidos/genética , Benchmarking , Sequência Conservada , Bases de Dados Genéticas , Doença/genética , Genoma Humano , Conformação Proteica , Proteoma/genética , Alinhamento de Sequência/métodos , Aprendizado de Máquina
3.
J Biol Chem ; 299(7): 104896, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37290531

RESUMO

Measuring the relative effect that any two sequence positions have on each other may improve protein design or help better interpret coding variants. Current approaches use statistics and machine learning but rarely consider phylogenetic divergences which, as shown by Evolutionary Trace studies, provide insight into the functional impact of sequence perturbations. Here, we reframe covariation analyses in the Evolutionary Trace framework to measure the relative tolerance to perturbation of each residue pair during evolution. This approach (CovET) systematically accounts for phylogenetic divergences: at each divergence event, we penalize covariation patterns that belie evolutionary coupling. We find that while CovET approximates the performance of existing methods to predict individual structural contacts, it performs significantly better at finding structural clusters of coupled residues and ligand binding sites. For example, CovET found more functionally critical residues when we examined the RNA recognition motif and WW domains. It correlates better with large-scale epistasis screen data. In the dopamine D2 receptor, top CovET residue pairs recovered accurately the allosteric activation pathway characterized for Class A G protein-coupled receptors. These data suggest that CovET ranks highest the sequence position pairs that play critical functional roles through epistatic and allosteric interactions in evolutionarily relevant structure-function motifs. CovET complements current methods and may shed light on fundamental molecular mechanisms of protein structure and function.


Assuntos
Evolução Molecular , Alinhamento de Sequência , Sítios de Ligação/genética , Filogenia , Receptores Acoplados a Proteínas G/genética , Alinhamento de Sequência/métodos
4.
Bioinformatics ; 39(5)2023 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-37084276

RESUMO

MOTIVATION: Protein sequence comparison is a fundamental element in the bioinformatics toolkit. When sequences are annotated with features such as functional domains, transmembrane domains, low complexity regions or secondary structure elements, the resulting feature architectures allow better informed comparisons. However, many existing schemes for scoring architecture similarities cannot cope with features arising from multiple annotation sources. Those that do fall short in the resolution of overlapping and redundant feature annotations. RESULTS: Here, we introduce FAS, a scoring method that integrates features from multiple annotation sources in a directed acyclic architecture graph. Redundancies are resolved as part of the architecture comparison by finding the paths through the graphs that maximize the pair-wise architecture similarity. In a large-scale evaluation on more than 10 000 human-yeast ortholog pairs, architecture similarities assessed with FAS are consistently more plausible than those obtained using e-values to resolve overlaps or leaving overlaps unresolved. Three case studies demonstrate the utility of FAS on architecture comparison tasks: benchmarking of orthology assignment software, identification of functionally diverged orthologs, and diagnosing protein architecture changes stemming from faulty gene predictions. With the help of FAS, feature architecture comparisons can now be routinely integrated into these and many other applications. AVAILABILITY AND IMPLEMENTATION: FAS is available as python package: https://pypi.org/project/greedyFAS/.


Assuntos
Sequência de Aminoácidos , Proteínas , Alinhamento de Sequência , Software , Humanos , Biologia Computacional/métodos , Proteínas/química , Alinhamento de Sequência/métodos , Proteínas de Saccharomyces cerevisiae/química
5.
Nucleic Acids Res ; 51(9): e53, 2023 05 22.
Artigo em Inglês | MEDLINE | ID: mdl-36987885

RESUMO

The functions of non-coding RNAs usually depend on their 3D structures. Therefore, comparing RNA 3D structures is critical in analyzing their functions. We noticed an interesting phenomenon that two non-coding RNAs may share similar substructures when rotating their sequence order. To the best of our knowledge, no existing RNA 3D structural alignment tools can detect this type of matching. In this article, we defined the RNA 3D structure circular matching problem and developed a software tool named CircularSTAR3D to solve this problem. CircularSTAR3D first uses the conserved stacks (consecutive base pairs with similar 3D structures) in the input RNAs to identify the circular matched internal loops and multiloops. Then it performs a local extension iteratively to obtain the whole circular matched substructures. The computational experiments conducted on a non-redundant RNA structure dataset show that circular matching is ubiquitous. Furthermore, we demonstrated the utility of CircularSTAR3D by detecting the conserved substructures missed by regular alignment tools, including structural motifs and conserved structures between riboswitches and ribozymes from different classes. We anticipate CircularSTAR3D to be a valuable supplement to the existing RNA 3D structural analysis techniques.


Assuntos
Conformação de Ácido Nucleico , RNA , Alinhamento de Sequência , Análise de Sequência de RNA , Software , Algoritmos , Pareamento de Bases , RNA/genética , RNA/química , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos
6.
Comput Math Methods Med ; 2022: 7191684, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35242211

RESUMO

Protein-protein interactions (PPIs) play a crucial role in understanding disease pathogenesis, genetic mechanisms, guiding drug design, and other biochemical processes, thus, the identification of PPIs is of great importance. With the rapid development of high-throughput sequencing technology, a large amount of PPIs sequence data has been accumulated. Researchers have designed many experimental methods to detect PPIs by using these sequence data, hence, the prediction of PPIs has become a research hotspot in proteomics. However, since traditional experimental methods are both time-consuming and costly, it is difficult to analyze and predict the massive amount of PPI data quickly and accurately. To address these issues, many computational systems employing machine learning knowledge were widely applied to PPIs prediction, thereby improving the overall recognition rate. In this paper, a novel and efficient computational technology is presented to implement a protein interaction prediction system using only protein sequence information. First, the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) was employed to generate a position-specific scoring matrix (PSSM) containing protein evolutionary information from the initial protein sequence. Second, we used a novel data processing feature representation scheme, MatFLDA, to extract the essential information of PSSM for protein sequences and obtained five training and five testing datasets by adopting a five-fold cross-validation method. Finally, the random fern (RFs) classifier was employed to infer the interactions among proteins, and a model called MatFLDA_RFs was developed. The proposed MatFLDA_RFs model achieved good prediction performance with 95.03% average accuracy on Yeast dataset and 85.35% average accuracy on H. pylori dataset, which effectively outperformed other existing computational methods. The experimental results indicate that the proposed method is capable of yielding better prediction results of PPIs, which provides an effective tool for the detection of new PPIs and the in-depth study of proteomics. Finally, we also developed a web server for the proposed model to predict protein-protein interactions, which is freely accessible online at http://120.77.11.78:5001/webserver/MatFLDA_RFs.


Assuntos
Mapeamento de Interação de Proteínas/métodos , Mapas de Interação de Proteínas/genética , Sequência de Aminoácidos , Proteínas de Bactérias/genética , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Análise Discriminante , Evolução Molecular , Helicobacter pylori/genética , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Aprendizado de Máquina , Matrizes de Pontuação de Posição Específica , Mapeamento de Interação de Proteínas/estatística & dados numéricos , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/genética , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos , Máquina de Vetores de Suporte
7.
Comput Math Methods Med ; 2022: 8691646, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35126641

RESUMO

Task scheduling in parallel multiple sequence alignment (MSA) through improved dynamic programming optimization speeds up alignment processing. The increased importance of multiple matching sequences also needs the utilization of parallel processor systems. This dynamic algorithm proposes improved task scheduling in case of parallel MSA. Specifically, the alignment of several tertiary structured proteins is computationally complex than simple word-based MSA. Parallel task processing is computationally more efficient for protein-structured based superposition. The basic condition for the application of dynamic programming is also fulfilled, because the task scheduling problem has multiple possible solutions or options. Search space reduction for speedy processing of this algorithm is carried out through greedy strategy. Performance in terms of better results is ensured through computationally expensive recursive and iterative greedy approaches. Any optimal scheduling schemes show better performance in heterogeneous resources using CPU or GPU.


Assuntos
Algoritmos , Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Biologia Computacional/estatística & dados numéricos , Humanos , Alinhamento de Sequência/estatística & dados numéricos , Software
8.
Int J Mol Sci ; 23(3)2022 Jan 28.
Artigo em Inglês | MEDLINE | ID: mdl-35163439

RESUMO

The presence of protein structures with atypical folds in the Protein Data Bank (PDB) is rare and may result from naturally occurring knots or crystallographic errors. Proper characterisation of such folds is imperative to understanding the basis of naturally existing knots and correcting crystallographic errors. If left uncorrected, such errors can frustrate downstream experiments that depend on the structures containing them. An atypical fold has been identified in P. falciparum dihydrofolate reductase (PfDHFR) between residues 20-51 (loop 1) and residues 191-205 (loop 2). This enzyme is key to drug discovery efforts in the parasite, necessitating a thorough characterisation of these folds. Using multiple sequence alignments (MSA), a unique insert was identified in loop 1 that exacerbates the appearance of the atypical fold-giving it a slipknot-like topology. However, PfDHFR has not been deposited in the knotted proteins database, and processing its structure failed to identify any knots within its folds. The application of protein homology modelling and molecular dynamics simulations on the DHFR domain of P. falciparum and those of two other organisms (E. coli and M. tuberculosis) that were used as molecular replacement templates in solving the PfDHFR structure revealed plausible unentangled or open conformations of these loops. These results will serve as guides for crystallographic experiments to provide further insights into the atypical folds identified.


Assuntos
Plasmodium falciparum/enzimologia , Alinhamento de Sequência/métodos , Tetra-Hidrofolato Desidrogenase/química , Tetra-Hidrofolato Desidrogenase/genética , Cristalografia por Raios X , Bases de Dados de Proteínas , Modelos Moleculares , Simulação de Dinâmica Molecular , Plasmodium falciparum/genética , Conformação Proteica , Domínios Proteicos , Dobramento de Proteína , Proteínas de Protozoários/química , Proteínas de Protozoários/genética , Análise de Sequência de Proteína , Homologia de Sequência de Aminoácidos
9.
PLoS One ; 17(2): e0261103, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35196314

RESUMO

A variety of islet autoantibodies (AAbs) can predict and possibly dictate eventual type 1 diabetes (T1D) diagnosis. Upwards of 75% of those with T1D are positive for AAbs against glutamic acid decarboxylase (GAD65 or GAD), a producer of gamma-aminobutyric acid (GABA) in human pancreatic beta cells. Interestingly, bacterial populations within the human gut also express GAD and produce GABA. Evidence suggests that dysbiosis of the microbiome may correlate with T1D pathogenesis and physiology. Therefore, autoimmune linkages between the gut microbiome and islets susceptible to autoimmune attack need to be further elucidated. Utilizing in silico analyses, we show that 25 GAD sequences from human gut bacterial sources show sequence and motif similarities to human beta cell GAD65. Our motif analyses determined that most gut GAD sequences contain the pyroxical dependent decarboxylase (PDD) domain of human GAD65, which is important for its enzymatic activity. Additionally, we showed overlap with known human GAD65 T cell receptor epitopes, which may implicate the immune destruction of beta cells. Thus, we propose a physiological hypothesis in which changes in the gut microbiome in those with T1D result in a release of bacterial GAD, thus causing miseducation of the host immune system. Due to the notable similarities we found between human and bacterial GAD, these deputized immune cells may then target human beta cells leading to the development of T1D.


Assuntos
Autoanticorpos/imunologia , Bactérias/enzimologia , Diabetes Mellitus Tipo 1/imunologia , Diabetes Mellitus Tipo 1/microbiologia , Microbioma Gastrointestinal/imunologia , Glutamato Descarboxilase/genética , Glutamato Descarboxilase/imunologia , Animais , Células Apresentadoras de Antígenos/imunologia , Simulação por Computador , Diabetes Mellitus Tipo 1/enzimologia , Epitopos de Linfócito T/imunologia , Genes Bacterianos , Humanos , Ilhotas Pancreáticas/enzimologia , Ilhotas Pancreáticas/imunologia , Camundongos , Pan troglodytes/microbiologia , Filogenia , Domínios Proteicos , Alinhamento de Sequência/métodos , Ácido gama-Aminobutírico/metabolismo
10.
Cell Rep ; 38(2): 110207, 2022 01 11.
Artigo em Inglês | MEDLINE | ID: mdl-35021073

RESUMO

Understanding and predicting the functional consequences of single amino acid changes is central in many areas of protein science. Here, we collect and analyze experimental measurements of effects of >150,000 variants in 29 proteins. We use biophysical calculations to predict changes in stability for each variant and assess them in light of sequence conservation. We find that the sequence analyses give more accurate prediction of variant effects than predictions of stability and that about half of the variants that show loss of function do so due to stability effects. We construct a machine learning model to predict variant effects from protein structure and sequence alignments and show how the two sources of information support one another and enable mechanistic interpretations. Together, our results show how one can leverage large-scale experimental assessments of variant effects to gain deeper and general insights into the mechanisms that cause loss of function.


Assuntos
Previsões/métodos , Estabilidade Proteica , Análise de Sequência de DNA/métodos , Substituição de Aminoácidos , Animais , Biologia Computacional/métodos , Humanos , Aprendizado de Máquina , Mutação/genética , Mutação/fisiologia , Proteínas/metabolismo , Alinhamento de Sequência/métodos
11.
Proc Natl Acad Sci U S A ; 119(5)2022 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-35091471

RESUMO

We report two structures of the human voltage-gated potassium channel (Kv) Kv1.3 in immune cells alone (apo-Kv1.3) and bound to an immunomodulatory drug called dalazatide (dalazatide-Kv1.3). Both the apo-Kv1.3 and dalazatide-Kv1.3 structures are in an activated state based on their depolarized voltage sensor and open inner gate. In apo-Kv1.3, the aromatic residue in the signature sequence (Y447) adopts a position that diverges 11 Å from other K+ channels. The outer pore is significantly rearranged, causing widening of the selectivity filter and perturbation of ion binding within the filter. This conformation is stabilized by a network of intrasubunit hydrogen bonds. In dalazatide-Kv1.3, binding of dalazatide to the channel's outer vestibule narrows the selectivity filter, Y447 occupies a position seen in other K+ channels, and this conformation is stabilized by a network of intersubunit hydrogen bonds. These remarkable rearrangements in the selectivity filter underlie Kv1.3's transition into the drug-blocked state.


Assuntos
Canal de Potássio Kv1.3/metabolismo , Canal de Potássio Kv1.3/ultraestrutura , Sequência de Aminoácidos/genética , Sítios de Ligação/fisiologia , Humanos , Ativação do Canal Iônico/fisiologia , Canal de Potássio Kv1.3/efeitos dos fármacos , Potenciais da Membrana , Microscopia Eletrônica/métodos , Modelos Moleculares , Conformação Molecular , Potássio/metabolismo , Canais de Potássio/metabolismo , Canais de Potássio/ultraestrutura , Canais de Potássio de Abertura Dependente da Tensão da Membrana/metabolismo , Canais de Potássio de Abertura Dependente da Tensão da Membrana/ultraestrutura , Alinhamento de Sequência/métodos
12.
PLoS Comput Biol ; 18(1): e1009802, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-35073327

RESUMO

Long-read-only bacterial genome assemblies usually contain residual errors, most commonly homopolymer-length errors. Short-read polishing tools can use short reads to fix these errors, but most rely on short-read alignment which is unreliable in repeat regions. Errors in such regions are therefore challenging to fix and often remain after short-read polishing. Here we introduce Polypolish, a new short-read polisher which uses all-per-read alignments to repair errors in repeat sequences that other polishers cannot. Polypolish performed well in benchmarking tests using both simulated and real reads, and it almost never introduced errors during polishing. The best results were achieved by using Polypolish in combination with other short-read polishers.


Assuntos
Genoma Bacteriano/genética , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , DNA Bacteriano/genética , Sequências Repetitivas de Ácido Nucleico/genética
13.
Nature ; 602(7895): 142-147, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35082445

RESUMO

Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially1. Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 105 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. We characterized novel viruses related to coronaviruses, hepatitis delta virus and huge phages, respectively, and analysed their environmental reservoirs. To catalyse the ongoing revolution of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.


Assuntos
Computação em Nuvem , Bases de Dados Genéticas , Vírus de RNA/genética , Vírus de RNA/isolamento & purificação , Alinhamento de Sequência/métodos , Virologia/métodos , Viroma/genética , Animais , Arquivos , Bacteriófagos/enzimologia , Bacteriófagos/genética , Biodiversidade , Coronavirus/classificação , Coronavirus/enzimologia , Coronavirus/genética , Evolução Molecular , Vírus Delta da Hepatite/enzimologia , Vírus Delta da Hepatite/genética , Humanos , Modelos Moleculares , Vírus de RNA/classificação , Vírus de RNA/enzimologia , RNA Polimerase Dependente de RNA/química , RNA Polimerase Dependente de RNA/genética , Software
14.
Int J Mol Sci ; 22(23)2021 Nov 27.
Artigo em Inglês | MEDLINE | ID: mdl-34884640

RESUMO

The field of protein structure prediction has recently been revolutionized through the introduction of deep learning. The current state-of-the-art tool AlphaFold2 can predict highly accurate structures; however, it has a prohibitively long inference time for applications that require the folding of hundreds of sequences. The prediction of protein structure annotations, such as amino acid distances, can be achieved at a higher speed with existing tools, such as the ProSPr network. Here, we report on important updates to the ProSPr network, its performance in the recent Critical Assessment of Techniques for Protein Structure Prediction (CASP14) competition, and an evaluation of its accuracy dependency on sequence length and multiple sequence alignment depth. We also provide a detailed description of the architecture and the training process, accompanied by reusable code. This work is anticipated to provide a solid foundation for the further development of protein distance prediction tools.


Assuntos
Redes Neurais de Computação , Proteínas/química , Sequência de Aminoácidos , Biologia Computacional/métodos , Humanos , Conformação Proteica , Dobramento de Proteína , Elementos Estruturais de Proteínas , Alinhamento de Sequência/métodos , Design de Software
15.
Proc Natl Acad Sci U S A ; 118(49)2021 12 07.
Artigo em Inglês | MEDLINE | ID: mdl-34873061

RESUMO

Information derived from metagenome sequences through deep-learning techniques has significantly improved the accuracy of template free protein structure modeling. However, most of the deep learning-based modeling studies are based on blind sequence database searches and suffer from low efficiency in computational resource utilization and model construction, especially when the sequence library becomes prohibitively large. We proposed a MetaSource model built on 4.25 billion microbiome sequences from four major biomes (Gut, Lake, Soil, and Fermentor) to decode the inherent linkage of microbial niches with protein homologous families. Large-scale protein family folding experiments on 8,700 unknown Pfam families showed that a microbiome targeted approach with multiple sequence alignment constructed from individual MetaSource biomes requires more than threefold less computer memory and CPU (central processing unit) time but generates contact-map and three-dimensional structure models with a significantly higher accuracy, compared with that using combined metagenome datasets. These results demonstrate an avenue to bridge the gap between the rapidly increasing metagenome databases and the limited computing resources for efficient genome-wide database mining, which provides a useful bluebook to guide future microbiome sequence database and modeling development for high-accuracy protein structure and function prediction.


Assuntos
Microbiota/genética , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Biologia Computacional/métodos , Bases de Dados de Proteínas , Aprendizado Profundo , Ecossistema , Evolução Molecular , Humanos , Metagenoma/genética , Redes Neurais de Computação , Conformação Proteica , Dobramento de Proteína , Proteínas/química , Homologia de Sequência
16.
Nat Commun ; 12(1): 6302, 2021 11 02.
Artigo em Inglês | MEDLINE | ID: mdl-34728624

RESUMO

Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the "generative capacity" of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model's generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE's lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general.


Assuntos
Mutação , Proteínas/química , Alinhamento de Sequência/métodos , Algoritmos , Sequência de Aminoácidos , Simulação por Computador , Bases de Dados de Proteínas , Humanos , Modelos Estatísticos , Elementos Estruturais de Proteínas , Proteínas/genética , Relação Estrutura-Atividade
17.
Genes (Basel) ; 12(11)2021 11 18.
Artigo em Inglês | MEDLINE | ID: mdl-34828415

RESUMO

Multiple sequence alignment (MSA) is the basis for almost all sequence comparison and molecular phylogenetic inferences. Large-scale genomic analyses are typically associated with automated progressive MSA without subsequent manual adjustment, which itself is often error-prone because of the lack of a consistent and explicit criterion. Here, I outlined several commonly encountered alignment errors that cannot be avoided by progressive MSA for nucleotide, amino acid, and codon sequences. Methods that could be automated to fix such alignment errors were then presented. I emphasized the utility of position weight matrix as a new tool for MSA refinement and illustrated its usage by refining the MSA of nucleotide and amino acid sequences. The main advantages of the position weight matrix approach include (1) its use of information from all sequences, in contrast to other commonly used methods based on pairwise alignment scores and inconsistency measures, and (2) its speedy computation, making it suitable for a large number of long viral genomic sequences.


Assuntos
Automação Laboratorial/métodos , Genômica/métodos , Alinhamento de Sequência/métodos , Algoritmos , Animais , Automação Laboratorial/normas , Genômica/normas , Humanos , Filogenia , Sensibilidade e Especificidade , Alinhamento de Sequência/normas , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/normas , Análise de Sequência de Proteína/métodos , Análise de Sequência de Proteína/normas
18.
J Comput Biol ; 28(11): 1063-1074, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34665648

RESUMO

The functional profile of metagenomic samples enables improved understanding of microbial populations in the environment. Such analysis consists of assigning short sequencing reads to a particular functional category. Normally, manually curated databases are used for functional assignment, and genes are arranged into different classes. Sequence alignment has been widely used to profile metagenomic samples against curated databases. However, this method is time consuming and requires high computational resources. While several alignment-free methods based on k-mer composition have been developed in recent years, they still require large amounts of computer main memory. In this article, MetaMLP (Metagenomics Machine Learning Profiler), a machine learning method that represents sequences as numerical vectors (embeddings) and uses a simple one hidden layer neural network to profile functional categories, is developed. Unlike other methods, MetaMLP enables partial matching by using a reduced alphabet to build sequence embeddings from full and partial k-mers. MetaMLP is able to identify a slightly larger number of reads compared with DIAMOND (one of the fastest sequence alignment methods), as well as to perform accurate predictions with 0.99 precision and 0.99 recall. MetaMLP can process 100M reads in ∼10 minutes on a laptop computer, which is 50 times faster than DIAMOND.


Assuntos
Biologia Computacional/métodos , Metagenômica/métodos , Alinhamento de Sequência/métodos , Algoritmos , Curadoria de Dados , Bases de Dados Genéticas , Aprendizado de Máquina , Análise de Sequência de DNA
19.
STAR Protoc ; 2(4): 100888, 2021 12 17.
Artigo em Inglês | MEDLINE | ID: mdl-34704076

RESUMO

Annotating protein-coding genes can be challenging, especially when searching for the best hits against multiple functional databases. This is partly because of "bad words" appearing as top hits, such as hypothetical or uncharacterized proteins. To help alleviate some of these issues, we designed a bioinformatics tool called NoBadWordsCombiner, which efficiently merges the hits from various databases, strengthening gene definitions by minimizing functional descriptions containing "bad words." Unlike other available tools, NoBadWordsCombiner is user friendly, but it does require users to have some general bioinformatics skills, including a basic understanding of the BLAST package and dash shell in Linux/Unix environments. For complete details on the use and execution of this protocol, please refer to Zhang et al. (2021a).


Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Anotação de Sequência Molecular , Alinhamento de Sequência/métodos , Software , Animais , Humanos , Camundongos , Anotação de Sequência Molecular/métodos , Proteínas/genética
20.
PLoS Comput Biol ; 17(10): e1009541, 2021 10.
Artigo em Inglês | MEDLINE | ID: mdl-34714829

RESUMO

We have developed the program TwinCons, to detect noisy signals of deep ancestry of proteins or nucleic acids. As input, the program uses a composite alignment containing pre-defined groups, and mathematically determines a 'cost' of transforming one group to the other at each position of the alignment. The output distinguishes conserved, variable and signature positions. A signature is conserved within groups but differs between groups. The method automatically detects continuous characteristic stretches (segments) within alignments. TwinCons provides a convenient representation of conserved, variable and signature positions as a single score, enabling the structural mapping and visualization of these characteristics. Structure is more conserved than sequence. TwinCons highlights alternative sequences of conserved structures. Using TwinCons, we detected highly similar segments between proteins from the translation and transcription systems. TwinCons detects conserved residues within regions of high functional importance for the ribosomal RNA (rRNA) and demonstrates that signatures are not confined to specific regions but are distributed across the rRNA structure. The ability to evaluate both nucleic acid and protein alignments allows TwinCons to be used in combined sequence and structural analysis of signatures and conservation in rRNA and in ribosomal proteins (rProteins). TwinCons detects a strong sequence conservation signal between bacterial and archaeal rProteins related by circular permutation. This conserved sequence is structurally colocalized with conserved rRNA, indicated by TwinCons scores of rRNA alignments of bacterial and archaeal groups. This combined analysis revealed deep co-evolution of rRNA and rProtein buried within the deepest branching points in the tree of life.


Assuntos
Sequência Conservada/genética , Aprendizado Profundo , RNA Ribossômico/genética , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Proteínas Arqueais/química , Proteínas Arqueais/genética , Proteínas de Bactérias/química , Proteínas de Bactérias/genética , Evolução Molecular , Metagenômica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...